Optimal Mixture Models in IR
نویسنده
چکیده
We explore the use of Optimal Mixture Models to represent topics. We analyze two broad classes of mixture models: set-based and weighted. We provide an original proof that estimation of set-based models is NP-hard, and therefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simple gradient descent technique. We demonstrate that Optimal Mixture Models can be successfully applied to the task of document retrieval. Our experiments show that weighted mixtures outperform a simple language modeling baseline. We also observe that weighted mixtures are more robust than other approaches of estimating topical models.
منابع مشابه
Learning to Rank Documents for Ad-Hoc Retrieval with Regularized Models
In language modeling (LM) approaches for information retrieval (IR), the estimation of document model is critical for retrieval effectiveness. Recent studies have proven that mixture models combining multiple resources can improve the accuracy of the estimation. There arises the problem of how to estimate the mixture weights in the model. In most previous studies, the mixture weights are assign...
متن کاملOptimal transport for Gaussian mixture models
We present an optimal mass transport framework on the space of Gaussian mixture models, which are widely used in statistical inference. Our method leads to a natural way to compare, interpolate and average Gaussian mixture models. Basically, we study such models on a certain submanifold of probability densities with certain structure. Different aspects of this framework are discussed and severa...
متن کاملMultivariate Density Estimation: a Support Vector Machine Approach
A Support Vector Machine (SVM) algorithm for multivariate density estimation is developed based on regularization principles and bounds on the convergence of empirical distribution functions. The algorithm is compared to Gaussian Mixture Models (GMMs). Our algorithm outperforms GMMs for data drawn from mixtures of gaussians in IR 2 and IR 6. Our algorithm is also automated with respect to param...
متن کاملUsing large clinical corpora for query expansion in text-based cohort identification
In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inher...
متن کاملImage Segmentation using Gaussian Mixture Model
Abstract: Stochastic models such as mixture models, graphical models, Markov random fields and hidden Markov models have key role in probabilistic data analysis. In this paper, we used Gaussian mixture model to the pixels of an image. The parameters of the model were estimated by EM-algorithm. In addition pixel labeling corresponded to each pixel of true image was made by Bayes rule. In fact,...
متن کامل